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(54) Highly available asynchronous I/O for clustered computer systems 

(57) One embodiment of the present invention pro- 
vides a system that allows an IAD request to proceed 
when a primary server that is processing the I/O request 
fails, and a secondary server takes over for the primary 
server. Upon receiving an I/O request from an applica- 
tion running on a client, the system stores parameters 
for the I/O request on the client, and sends the I/O 
request to the primary server. Next, the system allows 
the application on the client to continue executing while 
the I/O request is being processed. If the primary server 
fails after the I/O request is sent to the primary server, 
but before an I/O request completion indicator returns 
from the primary server, the system retries the I/O 
request to the secondary server using the parameters 
stored on the client. The I/O request may originate from 
a number of different sources, including a file system 
access, an I/O request from a database system, and a 
paging request from a virtual memory system. In a vari- 
ation on the above embodiment, the act of storing the 
parameters for the I/O request on the client includes 
creating a distributed object defined within a distributed 
object-oriented programming environment, and sending 
a reference to the distributed object to the primary 
server to be stored on the primary server. This causes a 
distributed operating system to keep track of the refer- 
ence so that if the primary server fails, the reference 
count on the distributed object drops to zero and the dis- 
tributed operating system notifies the client that the dis- 
tributed object is unreferenced. This allows the client to 
deduce that the primary server has failed. 




FIG. 2 



CL 

LU 



Primed by Xerox (UK) Business Services 
2.16.7 (HRSJ/3.6 



3NSDOCID: <EP 10013*3Al_l_> 



1 



EP 1 001 343 A1 



2 



Description 
BACKGROUND 

Field of the Invention s 

[0001] The present invention generally relates to 
operating systems for fault-tolerant distributed comput- 
ing systems. More particularly, the present invention 
relates to a system and method that supports asynchro- w 
nous I/O requests that can switch to a secondary server 
if a primary server for the I/O request fails. 

Belated Art 

15 

[0002] As computer networks are increasingly used 
to link stand-alone computer systems together, distrib- 
uted operating systems have been developed to control 
interactions between multiple computer systems on a 
computer network. Distributed operating systems gen- 20 
erally allow client computer systems to access 
resources or services on server computer systems. For 
example, a client computer system may access infor- 
mation contained in a database on a server computer 
system. However, when the server fails, it is desirable 25 
for the distributed operating system to automatically 
recover from this failure without the user client process 
being aware of the failure. Distributed computer sys- 
tems possessing the ability to recover from such server 
failures are referred to as "highly available systems, " 30 
and data objects stored on such highly available sys- 
tems are referred to as "highly available data objects " 
[0003] To function properly, a highly available sys- 
tem must be able to detect a failure of a primary server 
and reconfigure itself so that accesses to objects on the 35 
failed primary server are redirected to backup copies on 
a secondary server. This process of switching over to a 
backup copy on the secondary server is referred to as a 
"failover." 

[0004] Asynchronous I/O requests are particularly 40 
hard to implement in highly available systems. Asyn- 
chronous I/O requests allow a process to initiate an I/O 
request and continue processing while the I/O request 
is in progress. In this way, the process continues doing 
useful work - instead of blocking - while the I/O request 45 
is in progress, thereby increasing system performance. 
Unfortunately, a process typically has little control over 
when the I/O request completes. This lack of control 
over the timing of I/O requests can create problems in 
highly available systems, which must be able to recover so 
from primary server failures that can occur at any time 
while an asynchronous I/O request is in progress. 
[0005] What is needed is a highly available system 
that supports asynchronous I/O requests that can 
switch to a secondary server rf a primary server for the ss 
I/O request fails. 



SUMMARY 

[0006] One embodiment of the present invention 
provides a system that allows an I/O request to proceed 
when a primary server tat is processing the I/O request 
fails, and a secondary server takes over for the primary 
server. Upon receiving an I/O request from an applica- 
tion running on a client, the system stores parameters 
for the I/O request on the client, and sends the i/O 
request to the primary server. Next, the system allows 
the application on the client to continue executing while 
the I/O request is being processed, rf the primary server 
fails after the I/O request is sent to the primary server, 
but before an I/O request completion indicator returns 
from the primary server, the system retries the I/O 
request to the secondary server using the parameters 
stored on the client. The I/O request may originate from 
a number of different sources, including a file system 
access, an I/O request from a database system, and a 
paging request from a virtual memory system. In a vari- 
ation on the above embodiment, the act of storing the 
parameters for the I/O request on the client includes 
creating a distributed object defined within a distributed 
object-oriented programming environment, and sending 
a reference to the distributed object to the primary 
server to be stored on the primary server. This causes a 
distributed operating system to keep track of the refer- 
ence so that rf the primary server fails, the reference 
count on the distributed object drops to zero and the dis- 
tributed operating system notifies the client that the dis- 
tributed object is unreferenced. This allows the client to 
deduce that the primary server has failed. 
[0007] One aspect of the above embodiment 
involves locking a page of memory associated with the 
I/O request before sending the I/O request to the pri- 
mary server. This ensures that the page remains 
unmodified in case the I/O request needs to be retried. 
The page is ultimately unlocked when the I/O request 
completes and the primary server informs the client of 
the completion of the I/O request. 
[0008] Still other embodiments of the present inven- 
tion will become readily apparent to those skilled in the 
art from the following detailed description, wherein is 
shown and described only the embodiments for the 
invention by way of illustration of the best modes con- 
templated for carrying out the invention. As will be real- 
ized, the invention is capable of other and different 
embodiments and several of its details are capable of 
modifications in various obvious respects, all without 
departing from the spirit and scope of the present inven- 
tion. Accordingly, the drawings and detailed description 
are to be regarded as illustrative in nature and not as 
restrictive. 
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DESCRIPTION OF THE FIGURES 
[0009] 

FIG. 1 illustrates a distributed computing system in s 
accordance with an embodiment of the present 
invention. 

FIG. 2 illustrates functional components within a cli- 
ent and a primary server involved in implementing 
highly available asynchronous I/O operations in w 
accordance with an embodiment of the present 
invention. 

FIG. 3 illustrates part of the internal structure of a 
callback object in accordance with an embodiment 
of the present invention. is 
FIG. 4 is a flow chart illustrating some of the opera- 
tions involved in performing an asynchronous I/O 
operation in accordance with an embodiment of the 
present invention. 

FIG. 5 is a flow chart illustrating some of the opera- 20 
tions involved in recovering from a failure during an 
asynchronous I/O operation in accordance with an 
embodiment of the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 25 

Description of Distributed System 

[0010] FIG. 1 illustrates a distributed computing 
system in accordance with an embodiment of the 30 
present invention. This distributed computing system 
includes network 110, which is coupled to client 102, 
primary server 106 and secondary server 108. Network 
110 generally refers to any type of wire or wireless link 
between computers, including, but not limited to, a local 35 
area network, a wide area network, or a combination of 
networks. In general, a client, such as client 102, refers 
to an entity that requests a resource or a service. Corre- 
spondingly, a server, such as primary server 106 or sec- 
ondary server 1 08, services requests for resources and 40 
services. In certain cases, the client and server for an 
object may exist on the same computing node. In other 
cases, the client and server exist for an object on differ- 
ent computing nodes. In the example illustrated in FIG. 
1, client 102, client 102, primary server 106 and sec- 45 
ondary server 108 exist on separate computing nodes. 
[0011] Primary server 106 and secondary server 
108 are coupled to storage device 112. Storage device 
112 includes non-volatile storage for data to be 
accessed by primary server 106 and secondary server 50 
108. Although FIG. 1 illustrates direct communication 
links from storage device 1 1 2 to primary server 106 and 
to secondary server 108, these communication links 
may actually be implemented by messages across net- 
work 1 1 0, or another independent network. 55 
[001 2] In one embodiment of the present invention, 
the operating system for the distributed computing sys- 
tem illustrated in FIG. 1 is the is the Solaris MC operat- 



ing system, which is a product of Sun Microsystems, 
Inc. of Palo Alto, California. The Solaris MC operating 
system is a UNIX-based operating system. Hence, in 
describing the present technology, UNIX terminology 
and concepts are frequently used. However, this usage 
is for purposes of illustration and is not to be construed 
as limiting the invention to this particular operating sys- 
tem. 

[0013] In the illustrated embodiment, client 102 
includes highly-available asynchronous I/O system 104, 
which may be implemented as a library and which facil- 
itates asynchronous I/O operations that can be directed 
to a secondary server 108 when a primary server 106 
that is processing the I/O request fails. For example, 
assume client 102 has an outstanding I/O request to pri- 
mary server 106. If primary server 106 fails while this 
I/O request is outstanding, client 102 is eventually noti- 
fied by the distributed operating system that primary 
server 106 failed. This causes client 102 to retry the 
failed I/O request on secondary server 108. Note that 
secondary server 108 is capable of processing the 
same I/O request because it also can access storage 
device 112. 

Functional Components Involved in Asynchronous 

[0014] FIG. 2 illustrates functional components 
within client 102 and primary server 106, which are 
involved in implementing highly available asynchronous 
I/O operations in accordance with an embodiment of the 
present invention. In the illustrated embodiment, client 
102 from FIG. 1 contains user process 204. User proc- 
ess 204 makes an I/O request by calling an I/O function 
from system library 208. System library 208 includes 
functions that communicate with proxy file system 
(PXFS)212. 

[0015] User process 204 may include any process 
in client 102 that is capable of generating an I/O request 
to primary server 106. This includes, but is not limited to 
a user process that generates a file system reference, a 
database process that generates a database access, 
and a paging system that generates a page reference. 
Although FIG. 2 presents a "user" process for illustrative 
purposes, in general, any "user" or "system" process 
may generate the I/O request. 
[0016] System library 208 includes a collection of 
functions that implement various system calls, including 
functions that carry out I/O requests. An I/O routine 
within system library 208 typically converts a user-level 
system call from user process 204 into a kernel-level 
system call to perform the I/O operation. 
[0017] Proxy file system (PXFS) 212 is part of a 
highly available distributed file system that supports 
failovers from a primary server 106 to a secondary 
server 108 when primary server 106 fails. In the illus- 
trated embodiment, PXFS 212 includes callback object 
214, which contains information related to an associ- 
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ated asynchronous I/O operation. 
[0018] PXFS 212 within client 102 communicates 
with PXFS 222 within primary server 106. Note that 
PXFS 212 within client 102 and PXFS 222 within pri- 
mary server 106 are different parts of the same distrib- 
uted file system. PXFS 222 includes distributed object 
pointer 223, which is a reference to a callback object 
214 within client 102. In fact, distributed object pointer 
223 creates a distributed reference 225 to callback 
object 214. If primary server 106 fails, distributed refer- 
ence 225 disappears. This causes the count of active 
references to callback object 214 to drop to zero, which 
causes the distributed operating system to notify PXFS 
212 on client 102 that callback object 214 is unrefer- 
enced. Since primary server 106 is the only entity hold- 
ing a reference to callback object 214. client 102 can 
conclude that primary server 106 has failed. 
[0019] Primary server 106 additionally contains 
storage device driver 224. which receives commands 
from PXFS 222 and. in response to these commands, 
accesses storage device 1 12. 
[0020] Storage device driver 224 communicates 
with storage device 1 12 in order to perform specified I/O 
operations to storage device 1 1 2. 
[0021 ] FIG. 3 illustrates part of the internal structure 
of a callback object 214 in accordance with an embodi- 
ment of the present invention. The embodiment illus- 
trated in FIG. 3 includes pointer to I/O request 302 and 
I/O request status 304. Pointer to I/O request 302 is a 
pointer to an I/O request object containing information 
related to the pending I/O request. This pointer allows 
the pending I/O request to be retried on secondary 
server 108 when primary server 106 fails. Although the 
embodiment of callback object 214 illustrated in FIG. 3 
takes the form of an object defined within an object-ori- 
ented programming system, in general, any type of data 
structure that stores equivalent data items may be used. 
Note that client 102 creates and maintains a separate 
callback object 214 for each pending I/O request. 

Asynchrono us I/O Process 

[0022] FIG. 4 is a flow chart illustrating some of the 
operations involved in performing an asynchronous I/O 
operation in accordance with an embodiment of the 
present invention. These operations are described with 
reference to the functional components illustrated in 
FIG. 2. First, user process 204 (from FIG. 2) makes an 
I/O request (step 402). As mentioned previously, this I/O 
request may include any type of I/O request to storage 
device 1 12 (from FIG. 2). including a file system access, 
an I/O request from a database system, or a paging 
request from a virtual memory system. Next, I/O 
request 206 executes a system call from within system 
library 208 (step 404). This system call generates a ker- 
nel system call 210, which accesses proxy file system 
(PXFS) 212 on client 102. 

[0023] In processing the I/O request, PXFS 212 



creates and stores callback object 214 (step 406). In 
doing so. PXFS 21 2 initializes pointer to I/O request 302 
from FIG. 3 to point to an associated I/O request object 
(in order to allow the I/O request to be retried), as well 
5 as setting I/O request status 304 to "in progress." (step 
408) 

[0024] Next. PXFS 212 makes an invocation to pri- 
mary server 106 (step 410). This invocation includes a 
reference to callback object 214. In response to the 
w invocation, PXFS 222 in primary server 106 stores the 
reference to callback object 214 as distributed object 
pointer 223. This causes a distributed operating system 
to keep track of a distributed reference 225 to callback 
object 214. If primary server 106 fails, the distributed 

75 operating system notifies client 102 that callback object 
214 is unreferenced. Since primary server 106 is the 
only entity holding a reference to callback object 214, 
client 102 can conclude that primary server 106 has 
failed; client 102 can then take appropriate action. 

20 [0025] Next. PXFS 222 on primary server 1 06 calls 
storage device driver 224 to start an I/O operation (step 
412). Storage device driver 224 initiates the I/O opera- 
tion by sending a command to storage device 112. At 
this point the invocation returns to client 1 02 (step 41 4). 

25 Next, PXFS 212 on client 102 then forwards the return 
to user process 204, which allows user process 204 to 
continue processing (step 416). User process 204 can 
thereby perform useful work instead of waiting for the 
I/O operation to complete. In the mean time, the I/O 

30 operation continues processing and completes at some 
undetermined time in the future. 
[0026] When the I/O request is completed by stor- 
age device 1 12, storage device 122 sends an interrupt 
to storage device driver 224. In response to the inter- 
as rupt. storage device driver 224 calls I/O done function 
(step 420). This function causes primary server 106 to 
notify client 102 that the I/O request is complete (step 
422). Furthermore, if the I/O request was for a read v 
operation, data read from storage device 1 12 is passed 

40 back from client 102 to primary server 106 at this time. 
Next, PXFS 212 on client 102 sets I/O request status 
304 to "done," and unlocks any pages that were locked 
during the I/O request. During an I/O operation, the 
pages remain locked until the I/O operation completes 

45 to prevent the pages from being swapped out or deleted 
when the I/O is in progress. Note that the pages are 
locked at some time before the I/O operation is sent to 
primary server 106 in step 410. Also note that the inter- 
rupt may complete (in step 420) at any time after the I/O 

so request starts in step 412, because the I/O request is 
executing on a separate thread. Hence, step 420 may 
follow any of states 412, 414 and 416 as is indicated by 
the dashed lines. 

[0027] Next, primary server 106 releases distrib- 
55 uted reference 225 to callback object 214 (step 426). 
Since there is only one reference to callback object 214, 
when distributed reference 225 is released, callback 
object 214 is unreferenced. The distributed operating 
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system will eventually detect this fact and client 102 will 
receive an "unreferenced" message on callback object 
214 (step 428). In response to this unreferenced mes- 
sage, client 102 examines I/O request status 304 within 
callback object 214. If I/O request status indicates the 5 
I/O request is complete, client 102 finally deletes call- 
back object 214 (step 430). At this point, the I/O opera- 
tion is complete. 



Failure Recovery 



10 



[0028] FIG. 5 is a flow chart illustrating some of the 
operations involved in recovering from a failure during 
an asynchronous I/O operation in accordance with an 
embodiment of the present invention. If primary server 75 
106 fails before the I/O invocation is made to primary 
server 1 06 (in step 410 of FIG. 4), the invocation is auto- 
matically retried to secondary server 108 from FIG. 1 by 
the distributed operating system (step 504). 
[0029] Next, if primary server 106 fails after the I/O 20 
invocation to primary server 106 is made in step 410, 
but before the invocation returns to client 102 in step 
414, the replica framework of the distributed operating 
system performs a retry of the I/O request to secondary 
server 1 08 (step 502). In the illustrated embodiment, the 25 
replica framework is part of the Solaris MC operating 
system produced by SUN Microsystems, Inc. of Palo 
Alto, CA. However, other operating systems can use 
analogous mechanisms to perform the retry. Note that 
before the invocation reaches primary server 106, dis- 30 
tributed reference 225 to callback object 214 may not 
exist. Hence, client 102 cannot rely on receiving an 
unreferenced notification in case primary server 106 
dies. 

[0030] Next, if primary server 106 fails after the 35 
invocation returns to client 102 (in step 414), but before 
primary server 106 notifies client 102 of completion of 
the I/O operation in step 422, a series of events take 
place (step 506). When primary server 106 dies, the ref- 
erence count maintained by the distributed operating 40 
system for callback object 214 drops to zero. This is 
detected by the distributed operating system, and client 
102 receives an unreferenced message on callback 
object 214 from the distributed operating system. In 
response to this unreferenced message, PXFS 212 in 45 
client 102 checks to make certain that I/O request status 
304 within callback object 214 is set to "in progress." If 
so, PXFS 212 retries the I/O request to secondary 
server 108 using the original request structure indexed 
by pointer to I/O request 302 within callback object 214. so 
Note that user process 204 is not aware of this failure. 
[0031] If primary server 106 fails after primary 
server 106 notifies client 102 of completion of the I/O 
request, client 102 sets I/O request status 304 to 
"done." (step 508). Primary server 106 additionally 55 
cleans up the data structures associated with the I/O 
request. This may include unlocking pages involved in 
the I/O request and deleting callback object 214. 



CONCLUSION 

[0032] Thus, the present invention supports highly- 
available asynchronous I/O requests that can switch to 
a secondary server if a primary server for the I/O 
request fails. 

[0033] While the invention has been particularly 
shown and described with reference to embodiments 
thereof, those skilled in the art will understand that the 
foregoing and other changes in form and detail may be 
made therein without departing from the scope of the 
present invention. 

Claims 

1. A method that allows an I/O request (206) to pro- 
ceed when a primary server (106) that is process- 
ing the I/O request fails, and a secondary server 
(108) takes over for the primary sewer, the method 
comprising: 

receiving (402) the I/O request from a client 
application (204) running on a client (102); 
storing (406) parameters for the I/O request on 
the client; 

sending (410)the I/O request to the primary 
sewer; 

allowing (416) the client application to continue 
executing while the I/O request is being proc- 
essed; and 

if the primary server fails after the I/O request is 
sent to the primary server but before an I/O 
request completion indicator returns from the 
primary server, retrying the I/O request to the 
secondary server using the parameters for the 
I/O request stored on the client (504,505). 

2. The method of claim 1 , wherein the act of retrying 
the I/O request to the secondary server (108) takes 
place without the client application (204) being 
aware of the failure of the primary server (106). 

3. The method of claim 1 or claim 2, wherein the act of 
storing the parameters for the I/O request on the cli- 
ent includes: 

creating a distributed object (223) defined 
within a distributed object-oriented program- 
ming environment; and 

sending a reference to the distributed object 
(225) to be stored on the primary server caus- 
ing a distributed operating system to keep track 
of the reference; 

wherein if the primary server fails, the distrib- 
uted operating system notifies the client that 
the distributed object is unreferenced, allowing 
the client to conclude that the primary server 
has failed. 
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4. The method of any one of claims 1 to 3. further 
comprising if the primary server fails before the I/O 
request completion indicator is sent to the primary 
server, sending the I/O request to the secondary 
server (502). 5 

5. The method of any one of claims 1 to 4, wherein the 
act of storing parameters for the I/O request 
includes storing a status indicator for the I/O 
request on the client, the status indicator indicating w 
whether the I/O request is in progress or complete. 

6. The method of claim 5, further comprising if the pri- 
mary server completes and returns the I/O request 
completion indicator to the client, setting the status is 
indicator to a value indicating that the I/O request is 
complete (508). 

7. The method of any one of claims 1 to 6, further 
comprising if the primary server completes and 20 
returns the I/O request completion indicator to the 
client, removing (430) the parameters for the I/O 
request previously stored on the client. 

8. The method of any one of claims 1 to 7, further 25 
comprising: 

locking a page of memory associated with the 
I/O request before sending the I/O request to 
the primary server; and 30 
unlocking (424) the page of memory if the pri- 
mary server completes and returns the I/O 
request completion indicator to the client. 

9. The method of any one of claims 1 to 8. wherein the 35 
I/O request is one of, a file system access, an I/O 
request from a database system, and a paging 
request from a virtual memory system. 

10. An apparatus that allows an I/O request (206) to 40 
proceed when a primary server (106) that is 
processing the I/O request fails, and a secondary 
server (1 08) takes over for the primary server, com- 
prising: 

45 

an I/O processing mechanism that is config- 
ured to receive the I/O request from a client 
application (204) running on a client, store 
parameters for the I/O request on the client, 
and send the I/O request to the primary server; so 
an execution mechanism (400) that allows the 
client application to continue executing while 
the I/O request is being processed; and 
a retrying mechanism (500) that retries the I/O 
request to the secondary server using the 55 
parameters for the I/O request stored on the cli- 
ent if the primary server fails after the I/O 
request is sent to the primary server but before 



an I/O request completion indicator returns 
from the primary server. 

11. The apparatus of claim 10. wherein the retrying 
mechanism (500) is configured to retry the I/O 
request to the secondary server without the client 
application being aware of the failure of the primary 
server. 

12. The apparatus of claim 10 or claim 11, wherein 
upon receiving the I/O request the I/O processing 
mechanism is configured to: 

create a distributed object (223) defined within 
a distributed object-oriented programming 
environment; and 

send a reference to the distributed object (225) 
to be stored on the primary server causing a 
distributed operating system to keep track of 
the reference; 

wherein if the primary server fails, the distrib- 
uted operating system notifies the client that 
the distributed object is unreferenced, allowing 
the client to conclude that the primary server 
has failed. 

13. The apparatus of any one of claims 10 to 12, 
wherein the retrying mechanism (500) is configured 
to send the I/O request to the secondary server if 
the primary server fails before the I/O request is 
sent to the primary server (502). 

14. The apparatus of any one of claims 10 to 13, 
wherein the I/O processing mechanism (400) is 
configured to store (406) a status indicator for the 
I/O request on the client, the status indicator indi- 
cating whether the I/O request is in progress or 
complete. 

1 5. The apparatus of any one of claims 1 0 to 1 4, further 
comprising a page locking mechanism that is con- 
figured to lock a page of memory associated with 
the I/O request before sending the I/O request to 
the primary server, and to unlock (424) the page of 
memory if the primary server completes and 
returns the I/O request completion indicator to the 
client. 

16. A computer readable storage medium storing 
instructions that when executed by a computer 
cause a computer to perform a method that allows 
an I/O request (206) to proceed when a primary 
server (106) that is processing the I/O request fails, 
and a secondary server (108) takes over for the pri- 
mary server, the method comprising: 

receiving (402) the I/O request from a client 
application (204) running on a client; 
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storing (406) parameters for the I/O request on 
the client; 

sending (410) the I/O request to the primary 
server; 

allowing (41 6) the client application to continue s 
executing while the I/O request is being proc- 
essed; and 

rf the primary server fails after the I/O request is 
sent to the primary server but before an t/O 
request completion indicator returns from the w 
primary server, retrying the I/O request to the 
secondary server using the parameters for the 
I/O request stored on the client (504,505). 

17. A computer program encoding a set of computer is 
instructions for allowing an I/O request (206) to pro- 
ceed when a primary server (106) that is process- 
ing the I/O request fails, and a secondary server 
(108) takes over for the primary server, which when 
running on a computer or a computer network is 20 
adapted to perform the method as claimed in any 
one of claims 1 to 9. 
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